{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "79a0187c",
   "metadata": {},
   "source": [
    "# Lesson 5: Zero-Shot Audio Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49adf368",
   "metadata": {},
   "source": [
    "- In the classroom, the libraries have already been installed for you.\n",
    "- If you are running this code on your own machine, please install the following:\n",
    "``` \n",
    "    !pip install transformers\n",
    "    !pip install datasets\n",
    "    !pip install soundfile\n",
    "    !pip install librosa\n",
    "```\n",
    "\n",
    "The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. \n",
    "- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2a19033-a44c-48c0-b627-6f31512bbf32",
   "metadata": {},
   "source": [
    "- Here is some code that suppresses warning messages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b641dcda-9522-4dbc-b9b9-388c5a061748",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "from transformers.utils import logging\n",
    "logging.set_verbosity_error()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "673a28d3",
   "metadata": {},
   "source": [
    "### Prepare the dataset of audio recordings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d422f6b-d6ec-4f54-9022-6268c89a071c",
   "metadata": {
    "height": 115
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset, load_from_disk\n",
    "\n",
    "# This dataset is a collection of different sounds of 5 seconds\n",
    "# dataset = load_dataset(\"ashraq/esc50\",\n",
    "#                       split=\"train[0:10]\")\n",
    "dataset = load_from_disk(\"./models/ashraq/esc50/train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0afa364c-3ef9-4528-aa80-2ee61f96df55",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio_sample = dataset[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3175cab3-61b5-41dd-bf0d-843322e79ffe",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio_sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4afeafff-182b-4fe4-9ab6-4f3a3acc80a0",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "from IPython.display import Audio as IPythonAudio\n",
    "IPythonAudio(audio_sample[\"audio\"][\"array\"],\n",
    "             rate=audio_sample[\"audio\"][\"sampling_rate\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "167ed318",
   "metadata": {},
   "source": [
    "### Build the `audio classification` pipeline using 🤗 Transformers Library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c5a5820",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "from transformers import pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42b5c027-2990-4687-b560-3a4db3099c3c",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "zero_shot_classifier = pipeline(\n",
    "    task=\"zero-shot-audio-classification\",\n",
    "    model=\"./models/laion/clap-htsat-unfused\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb02a6f0",
   "metadata": {},
   "source": [
    "More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ea87bb3-6500-4558-9fee-a20bc557f753",
   "metadata": {},
   "source": [
    "### Sampling Rate for Transformer Models\n",
    "- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "97c904ec-2b9a-424b-bf1d-baad182d4b52",
   "metadata": {
    "height": 30
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12.0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(1 * 192000) / 16000"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "370a1cdb-e824-4d93-a846-58e13f6756c5",
   "metadata": {},
   "source": [
    "- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b89285b-c79d-422a-a9cf-f323205277db",
   "metadata": {},
   "source": [
    "- How about 5 seconds of audio?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "aa97b744-6a6a-49b9-ae26-ba0ff259c0d8",
   "metadata": {
    "height": 30
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "60.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(5 * 192000) / 16000"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca1cb81b-79c3-4852-a8d6-e8e38f6a6b9b",
   "metadata": {},
   "source": [
    "- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da863a69-37f0-4719-9914-d49d3b25c445",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "zero_shot_classifier.feature_extractor.sampling_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1574ba2-b1e3-4905-8f6d-7efaac7b5196",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio_sample[\"audio\"][\"sampling_rate\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ff8ea7e",
   "metadata": {},
   "source": [
    "* Set the correct sampling rate for the input and the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c53b982",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "from datasets import Audio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd0162d7-bc3f-4de9-816d-23531df64640",
   "metadata": {
    "height": 64
   },
   "outputs": [],
   "source": [
    "dataset = dataset.cast_column(\n",
    "    \"audio\",\n",
    "     Audio(sampling_rate=48_000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfb2c5b8",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio_sample = dataset[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed9991d3-2cc6-4289-ad51-3e6a3cf8860b",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "audio_sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "159ad36e-6c2b-4ec5-bf52-11d28f2654e4",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "candidate_labels = [\"Sound of a dog\",\n",
    "                    \"Sound of vacuum cleaner\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "690a7e3f-55f2-4a09-b73b-b5a03d393e63",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "zero_shot_classifier(audio_sample[\"audio\"][\"array\"],\n",
    "                     candidate_labels=candidate_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e373d97e-c03d-4a5d-9ca5-061f82dbacc6",
   "metadata": {
    "height": 81
   },
   "outputs": [],
   "source": [
    "candidate_labels = [\"Sound of a child crying\",\n",
    "                    \"Sound of vacuum cleaner\",\n",
    "                    \"Sound of a bird singing\",\n",
    "                    \"Sound of an airplane\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a04c213-7ab0-42b6-966b-c83c89e81b2d",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "zero_shot_classifier(audio_sample[\"audio\"][\"array\"],\n",
    "                     candidate_labels=candidate_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67d7834f",
   "metadata": {},
   "source": [
    "### Try it yourself! \n",
    "- Try this model with some other labels and audio files!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
